feat: generalized wildcard queries #590

Taepper · 2024-09-19T15:51:08Z

resolves #458

Summary

This PR resolves the generalized lineage refactor. Currently the system only worked with lineages following the pango lineage format as it was designed for SARS-CoV-2.

Now this is generalized and just receives an arbitrary lineage definition file which outlines the parent-child relations for the provided lineages.

Breaking Changes:

The preprocessing config fields has changed from pangoLineageDefinitionFilename to lineageDefinitionFilename
We take a yaml lineage definition file instead of a pango alias key, a script is provided scripts/alias2lineageDefinitions.py that transforms a list of lineages and an alias into the requested file format
We now validate on input and on query whether the provided lineage is in the defined lineages and throw errors accordingly

In the current version, we do not allow alias queries. This means that the user-provided query needs to e.g. contain a valid pango lineage for the respective instances. Searching for B.1.1.529.1 does not work and instead a search for BA.1 is required

PR Checklist

All necessary documentation has been adapted or there is an issue to do so.
The implemented feature is covered by an appropriate test.

github-actions · 2024-09-19T15:51:33Z

This is a preview of the changelog of the next release. If this branch is not up-to-date with the current main branch, the changelog may not be accurate. Rebase your branch on the main branch to get the most accurate changelog.

Note that this might contain changes that are on main, but not yet released.

Changelog:

0.3.0 (2024-10-16)

⚠ BREAKING CHANGES

generalized wildcard queries (#458)

Features

generalized wildcard queries (#458) (80363de)

Bug Fixes

correctly escape quotes in field names (7e7b448)
resolve aliases when inserting to or querying lineage indexes again (04fd1e0)
update script to also generate aliases (c19bef9)

Taepper · 2024-09-19T15:53:22Z

By the way, the majority of the line-count (12634 lines) is the new (more-verbose) lineage definition file: testBaseData/exampleDataset/lineage_definitions.yaml

fengelniederhammer

The commit message should reference the issue (ideally with resolves #... in the footer - then release please will automatically pick it up and mention it in the changelog)

Also this is a breaking change - the commit message should mention it.

Also I wonder whether it might be good to still have a metadata type lineage or similar? Would there be any advantages?

include/silo/common/bidirectional_map.h

testBaseData/samFiles/database_config.yaml

testBaseData/exampleDataset/small_metadata_set.tsv

src/silo_api/api.cpp

src/silo/config/util/config_repository.cpp

src/silo/database.cpp

fengelniederhammer · 2024-09-25T08:18:21Z

scripts/alias2lineageDefinitions.py

I find this quite complicated. Wouldn't it be easier to do it in plain Python?

You get better IDE support compared to when writing SQL

We would not need to DuckDB dependency in Python

ChatGPT gave me this, which almost works:

import json import argparse import yaml import shutil import sys import os import tempfile # Define the argument parser parser = argparse.ArgumentParser(description="Process a JSON file and convert it to a DataFrame.") parser.add_argument('alias_key', type=str, help='Path to the alias_key in JSON format') parser.add_argument('lineage_file', type=str, help='Path to the input_file containing all lineages') parser.add_argument('--preserve-tmp-dir', action='store_true', help='Preserve the temporary directory to keep the intermediate files') # Parse the arguments args = parser.parse_args() # Load the JSON data from the file path provided as argument with open(args.alias_key, 'r') as file: alias_key = json.load(file) # Create a list to store the reformatted data reformatted_data = [] # Loop through the JSON and format it as desired for key, value in alias_key.items(): if value and isinstance(value, str): reformatted_data.append({"name": key, "alias": value}) else: reformatted_data.append({"name": key, "alias": key}) # Load the lineage data from the file path provided as argument with open(args.lineage_file, 'r') as file: lineages = file.read().splitlines() # Create a dictionary to store the alias mappings alias_dict = {item['name']: item['alias'] for item in reformatted_data} # Function to unalias a lineage def unalias_lineage(lineage): parts = lineage.split('.') if parts[0] in alias_dict: parts[0] = alias_dict[parts[0]] return '.'.join(parts) # Unalias all lineages unaliased_lineages = {lineage: unalias_lineage(lineage) for lineage in lineages} # Function to find the immediate parent lineage def find_immediate_parent(lineage): if '.' in lineage: return lineage.rsplit('.', 1)[0] return None # Create the lineage definitions lineage_definitions = {} for lineage, unaliased in unaliased_lineages.items(): parent = find_immediate_parent(unaliased) if parent: parent_unaliased = unalias_lineage(parent) lineage_definitions[lineage] = {"parents": [parent_unaliased]} else: lineage_definitions[lineage] = {"parents": []} # Transform the data into YAML yaml.dump(lineage_definitions, sys.stdout, default_flow_style=False) if args.preserve_tmp_dir: temp_dir = tempfile.mkdtemp() print(f"Temporary directory: {temp_dir}", file=sys.stderr) else: temp_dir = tempfile.mkdtemp() shutil.rmtree(temp_dir)

I think that the SQL based script is less error-prone but effectively equivalent. I really like that I can just look at the intermediate tables and see the precise output of every single step in the script. What would the finished alternative look like?

I don't like that I need to look at SQL statements to understand the code. Probably I'm just a lot less used to SQL :D

The finished script would look quite similar I guess. It resolves an alias in a couple places where it should print the alias instead (A.1.2.3.4 instead of AA.4). I assume it's easy to fix.

I did a hybrid and moved the ugly SQL part into python :)

The remaining part would be much more error prone (implementing the first part already demonstrated that to me :D) and a lot more verbose in my opinion

scripts/generate_new_lineage_definitions.bash

fengelniederhammer

Actually looks good, but it's not quite ready to be merged yet.

Also please make sure that the changelog contains the issue.

src/silo/storage/reference_genomes.test.cpp

src/silo/storage/lineage_index.cpp

src/silo/database.cpp

src/silo/config/util/config_repository.test.cpp

include/silo/storage/column/indexed_string_column.h

src/silo/common/lineage_tree.cpp

scripts/generate_new_lineage_definitions.bash

src/silo/config/database_config.cpp

fengelniederhammer · 2024-10-09T15:42:34Z

I created GenSpectrum/LAPIS#978 to update the documentation.

BREAKING CHANGES: The preprocessing config field pangoLineageDefinitionFilename has been renamed to lineageDefinitionFilename. We now accept a YAML lineage definition file instead of a Pango alias key. A script (scripts/alias2lineageDefinitions.py) is provided to transform a list of lineages and an alias into the required file format. Input and query validation now checks whether the provided lineage exists in the defined lineages, and errors are thrown if validation fails. resolves #458

…v directories

Taepper marked this pull request as draft September 19, 2024 15:51

Taepper force-pushed the generalized-lineage branch 3 times, most recently from 886c2aa to 7e94a82 Compare September 23, 2024 05:44

Taepper requested a review from fengelniederhammer September 23, 2024 05:44

Taepper marked this pull request as ready for review September 23, 2024 05:44

fengelniederhammer requested changes Sep 23, 2024

View reviewed changes

fengelniederhammer reviewed Sep 25, 2024

View reviewed changes

fengelniederhammer reviewed Sep 26, 2024

View reviewed changes

scripts/generate_new_lineage_definitions.bash Outdated Show resolved Hide resolved

Taepper force-pushed the generalized-lineage branch 2 times, most recently from 406b12f to 561327d Compare October 4, 2024 15:36

Taepper changed the title ~~feat: generalized wildcard queries, first finished draft~~ feat: generalized wildcard queries Oct 4, 2024

Taepper force-pushed the generalized-lineage branch 8 times, most recently from 94f714f to 93733a8 Compare October 7, 2024 09:27

Taepper requested a review from fengelniederhammer October 7, 2024 09:28

Taepper force-pushed the generalized-lineage branch 3 times, most recently from ff6cb04 to c1cb660 Compare October 7, 2024 14:46

fengelniederhammer requested changes Oct 8, 2024

View reviewed changes

Taepper force-pushed the generalized-lineage branch 2 times, most recently from f3b1267 to 422d2e2 Compare October 8, 2024 14:31

Taepper requested a review from fengelniederhammer October 8, 2024 14:31

Taepper force-pushed the generalized-lineage branch 5 times, most recently from 1e2220b to b9a24c7 Compare October 9, 2024 07:14

fengelniederhammer reviewed Oct 9, 2024

View reviewed changes

src/silo/config/database_config.cpp Outdated Show resolved Hide resolved

Taepper force-pushed the generalized-lineage branch from 8222885 to b9a24c7 Compare October 9, 2024 14:13

fengelniederhammer mentioned this pull request Oct 9, 2024

feat(lapis): adapt to new generalized lineage system GenSpectrum/LAPIS#977

Merged

1 task

Taepper force-pushed the generalized-lineage branch from b9a24c7 to 71120a7 Compare October 14, 2024 15:28

Taepper added 5 commits October 16, 2024 10:42

fix: update script to also generate aliases

c19bef9

fix: resolve aliases when inserting to or querying lineage indexes again

04fd1e0

test: remove tsv tests from matrix e2e tests and rename ndjson and ts…

f97d30e

…v directories

chore: remove helper script, which generates lineage definitions file

1d9f6f7

Taepper force-pushed the generalized-lineage branch 2 times, most recently from 5807476 to 0336600 Compare October 16, 2024 08:44

refactor: rename lineage_index config to generate_lineage_index

0b2a670

Taepper force-pushed the generalized-lineage branch from 0336600 to 0b2a670 Compare October 16, 2024 09:13

Taepper merged commit 277eb68 into main Oct 16, 2024
7 checks passed

Taepper deleted the generalized-lineage branch October 16, 2024 09:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: generalized wildcard queries #590

feat: generalized wildcard queries #590

Taepper commented Sep 19, 2024

github-actions bot commented Sep 19, 2024 •

edited

Loading

Taepper commented Sep 19, 2024 •

edited

Loading

fengelniederhammer left a comment

fengelniederhammer Sep 25, 2024

fengelniederhammer Sep 25, 2024

Taepper Sep 26, 2024

fengelniederhammer Sep 26, 2024 •

edited

Loading

Taepper Oct 4, 2024

fengelniederhammer left a comment

fengelniederhammer commented Oct 9, 2024

feat: generalized wildcard queries #590

feat: generalized wildcard queries #590

Conversation

Taepper commented Sep 19, 2024

Summary

PR Checklist

github-actions bot commented Sep 19, 2024 • edited Loading

0.3.0 (2024-10-16)

⚠ BREAKING CHANGES

Features

Bug Fixes

Taepper commented Sep 19, 2024 • edited Loading

fengelniederhammer left a comment

Choose a reason for hiding this comment

fengelniederhammer Sep 25, 2024

Choose a reason for hiding this comment

fengelniederhammer Sep 25, 2024

Choose a reason for hiding this comment

Taepper Sep 26, 2024

Choose a reason for hiding this comment

fengelniederhammer Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

Taepper Oct 4, 2024

Choose a reason for hiding this comment

fengelniederhammer left a comment

Choose a reason for hiding this comment

fengelniederhammer commented Oct 9, 2024

github-actions bot commented Sep 19, 2024 •

edited

Loading

Taepper commented Sep 19, 2024 •

edited

Loading

fengelniederhammer Sep 26, 2024 •

edited

Loading